perm filename JHTALK[KI,ALS] blob sn#097066 filedate 1974-04-14 generic text, type T, neo UTF8
00100	The Stanford AI Pitch-Synchronous Fourier-Transform Formant Extractor
00200	
00300	The formant extractor is not a formant tracker in the usual sense since
00400	a fresh determination of the formant locations is made for each segment
00500	independently. This is thought to be desirable as it reveals
00600	rapid changes in formant location, particularly in the vicinity of
00700	obstruants where the character of the obstruant is frequently revealed
00800	more by these rapid transitions than by anything else. Only after this
00900	has been done is any attempt made to recogncile data for adjacent
01000	segments, as will be explained later.
01100	
01200	Formant identification is based on the use of Fourier transforms using
01300	single pitch period segments where the segment starts and ends at the
01400	zero crossing which preceeds the maximum excursion in amplitude.
01500	
01600	A study has been made of the effects of the segment location within the
01700	period and of the effect of the segment length. In general cleaner
01800	transforms are produced when the segment length is something less than
01900	the full period, 80% seems to be a reasonable compromise between
02000	cleanness and unwarrented broadening of the peaks in the spectrum because
02100	of insufficient points of data. However, it is questioned whether this
02200	is a reasonable thing to do since the location of the formant peaks is
02300	affected by the glottal loading during the latter part of the period
02400	and this is, of course, removed. It seems more reasonable to assume that
02500	the speaker modifies the shape of his upper vocal tract to compensate
02600	for his own pecular glottal loading effects since he attempts to produce
02700	sounds that match those produced by others and it is highly unlikely
02800	that the ear can  do anything to diamiguate glottal coupling effects.
02900	It is observed that this glottal loading effect is more pronounced
03000	for pitch periods that happen to be longer than the average.
03100	For all appearances it seems that most speakers delay
03200	the closing of the glottis rather than lengthening the closed time
03300	when they drop the pitch of their voice. A reasomable thing
03400	to do thus seems to be to use the full period for intervals 
03500	are normal or shorter and to restrict the length to the average
03600	length for long periods.
03700	
03800	The location of the formant peaks
03900	is also shifted somewhat by shifts in the starting point in the period
04000	since windowing attenuates contributions to the transform from the
04100	edge portions of the data but this effect is small as compared with
04200	the increase in ease with which the peaks can be located for the
04300	starting location as mentioned.
04400	
04500	The first operation is to locate the largest proper peaks found in
04600	each of six regions, these being the usual ranges for the first five
04700	formants and the region below the usual lower limit for the first
04800	formant. These limits are shifted between male and female voices, but
04900	in general we have not found it necessary to adjust them for the
05000	specific speaker. A proper peak is defined as the largest local maximum
05100	in the region that is bounded on both sides by points
05200	that are of lessor amplitude. If the five points for the five formant
05300	regions are distinct, that is no two are assigned the
05400	same value, the points are accepted as is, subject to a final
05500	medial smoothing operation which will be discribed later.
05600	
05700	Since the ranges for the formants overlap, frequent conflicts occur
05800	and thes must now be resolved. This is done starting at the low
05900	frequency end. Somewhat different strategies are used for different
06000	possible conflicts.
06100	
06200	Should the first and second formants identifications
06300	conflict then searches are made for the next largest proper peaks, to the 
06400	low frequence side extending the region to zero, and to the high
06500	frequency side to the upper limit of the F2 band. The amplitudes of these
06600	two new peaks and their positions with respect to
06700	median values for the F1 and F2 regions are then compared. Actually
06800	a decision made on the basis of amplitude only, allowing a 6 db credit
06900	for the higher frequency peak, seems to make the right decision almost
07000	always. A study will be made of this matter when a larger sample of data
07100	becomes available.
07200	
07300	Having resolved the conflict between F1 and F2, attention is then
07400	directed to a possible conflict between F2 and F3 which may have been
07500	introduced by the resolution of the F1 F2 conflict or which maw have been
07600	there initially. If a conflict is newly introduced then a second look
07700	is given to the F1 F2 conflict. Recourse is now made of a procedure
07800	to locate a possible F2 peak that had been obscured by a dominant
07900	F1 peak. The approximate shape of the original F1-F2 peak is assumed
08000	to be parobolic as determined from three data points these being that
08100	point at the maximum and points nearest the two three db down values.
08200	A fresh attempt is made to locate a new peak between the location of the
08300	disputed peak which is now extracted out from the data and the location
08400	previously found for F3. If such a peak is found it is assigned to
08500	F2 and attention is shifted to a possible F3-F4 conflict.
08600	
08700	Should an initial conflict be found between F2 and F3, this is resolved
08800	in essentially the same way except that no attempt is made to find
08900	a possible hidden F3 as was done for F2. Instead, if a conflict between
09000	F4 and F5 is produced by the resolution of an F3-F4 conflict then this
09100	is resolved just as if it were an initial conflict.
09200	
09300	
09400	Under certain circumstances it seems to be impossible to resolve all
09500	conflicts by the procedures just discribed. When this occurs the fai,lure
09600	to locate a proper peak is signaled by storing a zero for the formant in
09700	question and the program proceeds to the next formant. On the completion
09800	of this first go-around a second look is given to any zero values, and
09900	finally if still unresolved the zeros are replaced by the value for the
10000	formant in question by the value found for the previous time slot.
10100	
10200	Having resolved all conflicts in this way, then the exact locations for
10300	peaks are refined by parobolic interpolations based on the positions
10400	of the highest point and its two nearest neighbors. It is doubtful
10500	if the greater precision which results from this operation is at all
10600	needed, at least in the case of 512 point transforms on 20,000 hertz
10700	data. At least 2 bits of added precision can be obtained and
10800	the greatly improved smoothness of the resulting formant tracks seems
10900	to indicate that a corresponding incease in accuracy has resulted.
11000	
11100	The procedures so far discribed result in very good formant tracks.
11200	However there are still isolated points which appear to be out of line.
11300	Most of these appear to be situations where a person would be quite
11400	unable to make an assured decision. A certain few can be traced to
11500	failures in the pitch period determining procedure while others are
11600	due to more obscure reasons. In almost all cases these abnormalities
11700	persist for but a single pitch period and they can be corrected by
11800	a final process of medial smoothing. This is done in one direction only,
11900	going forward in time each value for each formant is replaced by the
12000	median value of the point in question, its predisesor (as already
12100	corrected) and its successor. Individual points which lie between
12200	their neighbors are not altered by this procedure. Errant points are
12300	replaced by values for the nearest neighbor. This procedure does have
12400	the effect of correcting true extrema but an extrema which persists for
12500	but a single pitch period probably does not contain much phonetic
12600	information and can probably be ignored. One could make allowances for
12700	true extrema by applying the medial smoothing only to points that
12800	lie more than, say, 2 db away from their nearest neighbor. This
12900	refinement seems entirely unnecessary but it is being kept in reserve.
13000	
13100	The advantages of this method of formant extraction over other more
13200	conventional tracking procedures seem to lie in the much improved
13300	results in the vicinity of obstruents where the rapid changes in formant
13400	location can be masked by tracking and where information as to the nature
13500	of the obstruent is contained in this transition region.